03:00
2024 Society for Freshwater Science Conference
EPA (USA)
EPA (USA)
6/2/24
Please visit https://usepa.github.io/spatial.sfs2024/ to view the workshop’s accompanying workbook
If necessary, install and load R packages by visiting “Setup” in the workbook’s “Welcome” section
(Optional) Download the workshop slides (instructions in the workbook’s “Welcome” section)
Follow along and have fun!
Michael Dumelle (he/him/his) is a statistician for the United States Environmental Protection Agency (USEPA). He works primarily on facilitating the survey design and analysis of USEPA’s National Aquatic Resource Surveys (NARS), which characterize the condition of waters across the United States. His primary research interests are in spatial statistics, survey design, environmental and ecological applications, and software development.
Ryan Hill (he/him/his) is an aquatic ecologist with the U.S. EPA Office of Research and Development. He is interested in how watershed conditions drive differences in freshwater diversity and water quality across the United States. He has worked extensively with federal physical, chemical, and biological datasets to gain insights into the factors affecting water quality and biotic condition of freshwaters across the conterminous US. He has also worked to develop and distribute large datasets of geospatial watershed metrics of streams and lakes for the Agency (EPA’s StreamCat and LakeCat datasets).
Lara Jansen (she/her/hers) is an aquatic community ecologist and an ORISE postdoctoral fellow working on predictive models of benthic macroinvertebrate communities across the conterminous US in relation to watershed factors. Lara completed her PhD in Environmental Science at Portland State University in 2023, studying the drivers and dynamics of harmful algal blooms in mountain lakes with Dr. Angela Strecker.She obtained a MS in Natural Resource Sciences at Cal Poly Humboldt University with a thesis focused on the downstream impacts of dam flow regulation on benthic macroinvertebrate and algal communities.
Wade Boys (he/him/his) is a graduate student at the University of Arkansas broadly interested in understanding how aquatic ectotherms will respond to climate change, especially the role of phenotypic plasticity in adapting to a warming world. Wade is a firm believer that science is not finished until it is communicated. In addition to research, he finds great purpose in cultivating community and connecting science to our everyday experiences as humans.
The views expressed in this workshop are those of the authors and do not necessarily represent the views or policies of the U.S. Environmental Protection Agency. Any mention of trade names, products, or services does not imply an endorsement by the U.S. government or the U.S. Environmental Protection Agency. The U.S. Environmental Protection Agency does not endorse any commercial products, services, or enterprises.
spmodel?spmodel is an R package to fit, summarize, and predict for a variety of spatial statistical models. Some of the things that spmodel can do include:
spmodel?There are many great spatial modeling packages in R. A few reasons to use spmodel for spatial analysis are that:
spmodel syntax is similar to base R syntax for functions like lm(), glm(), summary(), and predict(), making the transition from fitting non-spatial models to spatial models relatively seamless.spmodel capabilities that give the user significant control over the specific spatial model being fit.spmodel is compatible with other modern R packages like broom and sf.spmodel
spmodel
Nonspatial linear models are incredibly flexible tools that quantify the relationship between a response variable and a set of explanatory variables
Examples of nonspatial linear models: multiple linear regression, analysis of variance (ANOVA), splines, polynomial regression, additive models, and mixed effect models
\[ \begin{split} \text{y} & = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_k + \epsilon \\ i & = 1, 2, \dots, k \end{split} \]
Generalizing becomes
\[ \begin{split} \text{y}_j & = \beta_0 + \beta_1 x_{1, j} + \beta_2 x_{2, j} + \dots + \beta_k x_{k, j} + \epsilon_j \\ i & = 1, 2, \dots, k \\ j & = 1, 2, \dots, n \end{split} \]
Matrix notation:
\[ \mathbf{y} = \begin{bmatrix} \text{y}_1 \\ \text{y}_2 \\ \vdots \\ \text{y}_j \\ \end{bmatrix}, \mathbf{X} = \begin{bmatrix} 1 & x_{1, 1} & x_{1, 2} & \dots & x_{1, k} \\ 1 & x_{2, 1} & x_{2, 2} & \dots & x_{2, k} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n, 1} & x_{n, 2} & \dots & x_{n, k} \\ \end{bmatrix}, \]
Matrix notation:
\[ \boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{bmatrix}, \text{ and } \boldsymbol{\epsilon} = \begin{bmatrix} \epsilon_0 \\ \epsilon_1 \\ \vdots \\ \epsilon_j \end{bmatrix}, \] The nonspatial linear model is written as \[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \]
Introduce yourself to one (or more) neighbor(s)! What kinds of data have you used statistical modeling for? Have you ever used linear models via lm() in R?
03:00
spmodel makes it straightforward to implement these models.\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\tau} + \boldsymbol{\epsilon}, \]
splm() and spautor()
spmodel
splm() function using the moss data.splm() to the spatial linear model.moss DataThe moss data contains a variable for log Zinc concentration for moss samples collected near a mining road in Alaska.
moss Datamoss Data| sideroad | log_dist2road | log_Zn | geometry |
|---|---|---|---|
| N | 2.68 | 7.33 | POINT (-413585.3 1997623) |
| N | 2.68 | 7.38 | POINT (-413585.3 1997623) |
| N | 2.54 | 7.58 | POINT (-415367.2 1996769) |
splm() functionThe splm() function shares syntactic structure with the lm() function and generally requires at least three arguments
formula: a formula that describes the relationship between the response variable (\(\mathbf{y}\)) and explanatory variables (\(\mathbf{X}\))data: a data.frame or sf object that contains the response variable, explanatory variables, and spatial information.spcov_type: the spatial covariance type ("exponential", "Gaussian", "spherical", etc; 17 total types)splm() functionspmod <- splm(formula = log_Zn ~ log_dist2road, data = moss,
spcov_type = "exponential")
summary(spmod)
Call:
splm(formula = log_Zn ~ log_dist2road, data = moss, spcov_type = "exponential")
Residuals:
Min 1Q Median 3Q Max
-2.6801 -1.3606 -0.8103 -0.2485 1.1298
Coefficients (fixed):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.76825 0.25216 38.74 <2e-16 ***
log_dist2road -0.56287 0.02013 -27.96 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pseudo R-squared: 0.683
Coefficients (exponential spatial covariance):
de ie range
3.595e-01 7.897e-02 8.237e+03
The tidy() function tidies fixed effect model output into a convenient tibble (a special data.frame)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 9.77 0.252 38.7 0
2 log_dist2road -0.563 0.0201 -28.0 0
log_dist2road is zero, the average log_Zn value is 9.768log_dist2road is associated with an average decrease of 0.563 in log_Zn
It can also be used to tidy spatial covariance parameters
# A tibble: 3 × 3
term estimate is_known
<chr> <dbl> <lgl>
1 de 0.360 FALSE
2 ie 0.0790 FALSE
3 range 8237. FALSE
range: A parameter that controls the distance-decay rate of the spatial covarianceThe glance() function returns columns with the sample size (n), number of fixed effects (p), number of estimated covariance parameters (npar), optimization criteria minimum (value), AIC (AIC), AICc (AICc), log-likelihood (loglik), deviance (deviance), and pseudo R-squared (pseudo.r.squared)
The augment() function returns columns with
.fitted, the fitted value, calculated from the estimated fixed effects in the model.resid, the raw residual (observed minus fitted).hat, the Mahalanobis distance, a metric of leverage.cooksd, the Cook’s distance, a metric of influence.std.resid, the standardized residualSimple feature collection with 365 features and 7 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -445884.1 ymin: 1929616 xmax: -383656.8 ymax: 2061414
Projected CRS: NAD83 / Alaska Albers
# A tibble: 365 × 8
log_Zn log_dist2road .fitted .resid .hat .cooksd .std.resid
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 7.33 2.68 8.26 -0.928 0.0200 0.0142 1.18
2 7.38 2.68 8.26 -0.880 0.0200 0.0186 1.35
3 7.58 2.54 8.34 -0.755 0.0225 0.00482 0.647
4 7.63 2.97 8.09 -0.464 0.0197 0.0305 1.74
5 7.26 2.72 8.24 -0.977 0.0215 0.131 3.45
6 7.65 2.76 8.21 -0.568 0.0284 0.0521 1.89
7 7.59 2.30 8.47 -0.886 0.0300 0.0591 1.96
8 7.16 2.78 8.20 -1.05 0.0335 0.00334 0.439
9 7.19 2.93 8.12 -0.926 0.0378 0.0309 1.26
10 8.07 2.79 8.20 -0.123 0.0314 0.00847 0.723
# ℹ 355 more rows
# ℹ 1 more variable: geometry <POINT [m]>
Run the fitted(), residuals(), hatvalues(), cooks.distance(), and rstandard() functions and verify they return the same values as augment().
05:00
The plot() function can be used on a fitted model object to construct a few pre-specified plots of these model diagnostics. For example, the following code plots the Cook’s distance, a measure of influence, which quantifies each observation’s impact on model fit:
Powerful to combine diagnostics with spatial visualizations via augment()
Use spmodel’s plot function on the spmod object to construct a plot of the fitted spatial covariance vs spatial distance. To learn more about the options for spmodel’s plot function, run ?plot.spmodel.
03:00
splm() and spcov_type = "none"
Using sumamry(), compare the output of the none model fit via splm() to a model fit via lm(). Do you get the same estimates and standard errors?
03:00
glances() lets us compare the fit of several models simultaneously# A tibble: 2 × 10
model n p npar value AIC AICc logLik deviance pseudo.r.squared
<chr> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 spmod 365 2 3 367. 373. 373. -184. 363. 0.683
2 none 365 2 1 634. 636. 636. -317. 363. 0.671
estmethod = "reml")estmethod = "ml")Leave-one-out cross validation (loocv):
loocv() returns several useful fit statistics:
bias: The average difference between the observed value and its leave-one-out prediction. This should be close to zero for well-fitting models.MSPE: The mean squared prediction error between the observed value and its leave-one-out prediction.RMSPE: The square root of MSPE.cor2: The squared correlation between the observed value and its leave-one-out prediction. Can be viewed as a “predictive” R-squared (can be compared across models).# A tibble: 1 × 4
bias MSPE RMSPE cor2
<dbl> <dbl> <dbl> <dbl>
1 0.00655 0.111 0.333 0.886
# A tibble: 1 × 4
bias MSPE RMSPE cor2
<dbl> <dbl> <dbl> <dbl>
1 0.000644 0.324 0.569 0.667
MSPE/RMSPE and higher (better) cor2.loocv() can be used to compare two models fit using REML that have different fixed effect structuresmoose data.moose Dataspmodel called moose (and moose_preds)moose data contains moose counts and moose presence for 218 spatial locations in Alaska.moose Datamoose Data| elev | strat | count | presence | geometry |
|---|---|---|---|---|
| 469 | L | 0 | 0 | POINT (293542.6 1541016) |
| 362 | L | 0 | 0 | POINT (298313.1 1533972) |
| 173 | M | 0 | 0 | POINT (281896.4 1532516) |
moose_preds Datamoose_preds data contains spatial locations that were not surveyed for which predictions are desiredmoose_preds Datamoose_preds Data| elev | strat | geometry |
|---|---|---|
| 143.4000 | L | POINT (401239.6 1436192) |
| 324.4375 | L | POINT (352640.6 1490695) |
| 158.2632 | L | POINT (360954.9 1491590) |
elev * strat is shorthand for elev + strat + elev:strat
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.310 9.02 0.0344 0.973
2 elev 0.0141 0.00806 1.76 0.0792
3 stratM 6.93 2.26 3.07 0.00217
4 elev:stratM -0.0273 0.0130 -2.10 0.0357
moose_preds
Using predict()
Using augment()
moose_preds
Simple feature collection with 100 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 269386.2 ymin: 1418453 xmax: 419976.2 ymax: 1541763
Projected CRS: NAD83 / Alaska Albers
# A tibble: 100 × 4
elev strat .fitted geometry
* <dbl> <chr> <dbl> <POINT [m]>
1 143. L 3.45 (401239.6 1436192)
2 324. L 1.59 (352640.6 1490695)
3 158. L -0.267 (360954.9 1491590)
4 221. M 2.39 (291839.8 1466091)
5 209. M 7.62 (310991.9 1441630)
6 218. L -1.02 (304473.8 1512103)
7 127. L -1.23 (339011.1 1459318)
8 122. L -1.43 (342827.3 1463452)
9 191 L -0.239 (284453.8 1502837)
10 105. L 0.657 (391343.9 1483791)
# ℹ 90 more rows
moose_preds
We can construct a plot of the predictions with
moose_preds
Examine the help file ?augment.spmodel or by visiting this link and create site-wise 99% prediction intervals for the unsampled locations found in moose_preds.
05:00
Applications to:
spmodel
spglm() function in spmodel to fit generalized linear models for various model families (i.e., response distributions).The spatial generalized linear model can be written as
\[ g(\boldsymbol{\mu}) = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\tau} + \boldsymbol{\epsilon}, \]
| Distribution | Data Type | Link Function |
|---|---|---|
| Poisson | Count | Log |
| Negative Binomial | Count | Log |
| Binomial | Binary | Logit |
| Beta | Proportion | Logit |
| Gamma | Skewed | Log |
| Inverse Gaussian | Skewed | Log |
moose Data| elev | strat | count | presence | geometry |
|---|---|---|---|---|
| 468.9 | L | 0 | 0 | POINT (293542.6 1541016) |
| 362.3 | L | 0 | 0 | POINT (298313.1 1533972) |
| 172.8 | M | 0 | 0 | POINT (281896.4 1532516) |
family argument can be binomial, beta, poisson, nbinomial, Gamma, or inverse.gaussian
spglm() and glm()
Call:
spglm(formula = count ~ elev * strat, family = poisson, data = moose,
spcov_type = "spherical")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4245 -0.7783 -0.3653 0.1531 0.5900
Coefficients (fixed):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.230575 0.958201 -2.328 0.019919 *
elev 0.007623 0.003129 2.437 0.014820 *
stratM 2.752234 0.782853 3.516 0.000439 ***
elev:stratM -0.010248 0.004472 -2.292 0.021928 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pseudo R-squared: 0.09573
Coefficients (spherical spatial covariance):
de ie range
3.892 1.163 51204.657
Coefficients (Dispersion for poisson family):
dispersion
1
Fit a spatial negative binomial model to the moose data with count as the response, elev, strat, and their interaction as predictors, and the "gaussian" spatial covariance function. Compare their fits using glances() and loocv(). Which model is preferable based on AIC/AICc? What about leave-one-out MSPE?
05:00
moose_preds
Using predict()
Using augment()
moose_preds
Simple feature collection with 100 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 269386.2 ymin: 1418453 xmax: 419976.2 ymax: 1541763
Projected CRS: NAD83 / Alaska Albers
# A tibble: 100 × 4
elev strat .fitted geometry
* <dbl> <chr> <dbl> <POINT [m]>
1 143. L 0.207 (401239.6 1436192)
2 324. L -0.0563 (352640.6 1490695)
3 158. L -1.24 (360954.9 1491590)
4 221. M -1.16 (291839.8 1466091)
5 209. M 1.78 (310991.9 1441630)
6 218. L -1.84 (304473.8 1512103)
7 127. L -2.80 (339011.1 1459318)
8 122. L -2.45 (342827.3 1463452)
9 191 L -0.409 (284453.8 1502837)
10 105. L -1.10 (391343.9 1483791)
# ℹ 90 more rows
Use spglm() to fit a spatial logistic regression model to the moose data using presence as the response variable and a Cauchy covariance function. Then, find the predicted probabilities that moose are present at the spatial locations in moose_preds (Hint: Use the type argument in predict() or augment()).
05:00
Maintaining all analyses within a single software (R) can greatly simplify your research workflow. In this section, we’ll cover the basics of doing GIS in R.
sf package.We can represent these features in R without actually using GIS packages.
library(tidyverse)
library(ggplot2)
id <- c(1:5)
cities <- c('Ashland','Corvallis','Bend','Portland','Newport')
longitude <- c(-122.699, -123.275, -121.313, -122.670, -124.054)
latitude <- c(42.189, 44.57, 44.061, 45.523, 44.652)
population <- c(20062, 50297, 61362, 537557, 9603)
oregon_cities <- data.frame(id, cities, longitude, latitude, population)So, is this sufficient for working with spatial data in R and doing spatial analysis? What are we missing?
If you have worked with vector data before, you may know that these data also usually have:
sf) packagesf package provides simple features access for R
sf fits in within the “tidy” approach to data of Hadley Wickham’s tidyverse
R with sf
sf) package [1] "%>%" "as_Spatial"
[3] "dbDataType" "dbWriteTable"
[5] "gdal_addo" "gdal_create"
[7] "gdal_crs" "gdal_extract"
[9] "gdal_inv_geotransform" "gdal_metadata"
[11] "gdal_polygonize" "gdal_rasterize"
[13] "gdal_read" "gdal_read_mdim"
[15] "gdal_subdatasets" "gdal_utils"
[17] "gdal_write" "gdal_write_mdim"
[19] "get_key_pos" "NA_agr_"
[21] "NA_bbox_" "NA_crs_"
[23] "NA_m_range_" "NA_z_range_"
[25] "plot_sf" "rawToHex"
[27] "read_sf" "sf.colors"
[29] "sf_add_proj_units" "sf_extSoftVersion"
[31] "sf_proj_info" "sf_proj_network"
[33] "sf_proj_pipelines" "sf_proj_search_paths"
[35] "sf_project" "sf_use_s2"
[37] "st_agr" "st_agr<-"
[39] "st_area" "st_as_binary"
[41] "st_as_grob" "st_as_s2"
[43] "st_as_sf" "st_as_sfc"
[45] "st_as_text" "st_axis_order"
[47] "st_bbox" "st_bind_cols"
[49] "st_boundary" "st_break_antimeridian"
[51] "st_buffer" "st_can_transform"
[53] "st_cast" "st_centroid"
[55] "st_collection_extract" "st_combine"
[57] "st_concave_hull" "st_contains"
[59] "st_contains_properly" "st_convex_hull"
[61] "st_coordinates" "st_covered_by"
[63] "st_covers" "st_crop"
[65] "st_crosses" "st_crs"
[67] "st_crs<-" "st_delete"
[69] "st_difference" "st_dimension"
[71] "st_disjoint" "st_distance"
[73] "st_drivers" "st_drop_geometry"
[75] "st_equals" "st_equals_exact"
[77] "st_filter" "st_geometry"
[79] "st_geometry_type" "st_geometry<-"
[81] "st_geometrycollection" "st_graticule"
[83] "st_inscribed_circle" "st_interpolate_aw"
[85] "st_intersection" "st_intersects"
[87] "st_is" "st_is_empty"
[89] "st_is_longlat" "st_is_simple"
[91] "st_is_valid" "st_is_within_distance"
[93] "st_jitter" "st_join"
[95] "st_layers" "st_length"
[97] "st_line_interpolate" "st_line_merge"
[99] "st_line_project" "st_line_sample"
[101] "st_linestring" "st_m_range"
[103] "st_make_grid" "st_make_valid"
[105] "st_minimum_rotated_rectangle" "st_multilinestring"
[107] "st_multipoint" "st_multipolygon"
[109] "st_nearest_feature" "st_nearest_points"
[111] "st_node" "st_normalize"
[113] "st_overlaps" "st_perimeter"
[115] "st_point" "st_point_on_surface"
[117] "st_polygon" "st_polygonize"
[119] "st_precision" "st_precision<-"
[121] "st_read" "st_read_db"
[123] "st_relate" "st_reverse"
[125] "st_sample" "st_segmentize"
[127] "st_set_agr" "st_set_crs"
[129] "st_set_geometry" "st_set_precision"
[131] "st_sf" "st_sfc"
[133] "st_shift_longitude" "st_simplify"
[135] "st_snap" "st_sym_difference"
[137] "st_touches" "st_transform"
[139] "st_triangulate" "st_triangulate_constrained"
[141] "st_union" "st_viewport"
[143] "st_voronoi" "st_within"
[145] "st_wrap_dateline" "st_write"
[147] "st_write_db" "st_z_range"
[149] "st_zm" "vec_cast.sfc"
[151] "vec_ptype2.sfc" "write_sf"
sf) packagecrs).sf) packageoregon_cities <- oregon_cities %>%
st_as_sf(coords = c('longitude', 'latitude'), crs = 4269)
print(oregon_cities)Simple feature collection with 5 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -124.054 ymin: 42.189 xmax: -121.313 ymax: 45.523
Geodetic CRS: NAD83
id cities population geometry
1 1 Ashland 20062 POINT (-122.699 42.189)
2 2 Corvallis 50297 POINT (-123.275 44.57)
3 3 Bend 61362 POINT (-121.313 44.061)
4 4 Portland 537557 POINT (-122.67 45.523)
5 5 Newport 9603 POINT (-124.054 44.652)
sf) packageThe oregon_cities object has now changed from being a standard data frame and includes features that are required for a true spatial object.
Latitude and longitude columns have been moved to a new column called "geometry" (sticky).
sf also has functionality to re-project and manipulate spatial objects.
oregon_cities is currently in degrees, but certain applications may require an equal area projection and length units, such as meters. See: epsg.org
With sf we can:
oregon_cities to CRS 5070Now let’s plot the the transformed data…
R
R without going back and forth to other GIS softwareR, the leap to doing GIS here can be smallsf provides a large number of GIS functions, such as buffers, intersection, centroids, etc.R
Example 1: Add 100 Km buffer to cities
Plot map of buffered cities…
R
R
Example 2: Split buffers into sub-units and calculate areas
Plot results:
R
terra package is now the best package for working with rastersMuch like sf, terra has a large number of functions for working with raster data.
[1] "%in%" "activeCat" "activeCat<-"
[4] "add_box" "add_legend" "add<-"
[7] "addCats" "adjacent" "aggregate"
[10] "align" "all.equal" "allNA"
[13] "animate" "app" "approximate"
[16] "area" "Arith" "as.array"
[19] "as.bool" "as.contour" "as.data.frame"
[22] "as.factor" "as.int" "as.lines"
[25] "as.list" "as.matrix" "as.points"
[28] "as.polygons" "as.raster" "atan_2"
[31] "atan2" "autocor" "barplot"
[34] "blocks" "boundaries" "boxplot"
[37] "buffer" "cartogram" "catalyze"
[40] "categories" "cats" "cbind2"
[43] "cellFromRowCol" "cellFromRowColCombine" "cellFromXY"
[46] "cells" "cellSize" "centroids"
[49] "clamp" "clamp_ts" "classify"
[52] "clearance" "click" "colFromCell"
[55] "colFromX" "colorize" "coltab"
[58] "coltab<-" "combineGeoms" "compare"
[61] "Compare" "compareGeom" "concats"
[64] "contour" "convHull" "costDist"
[67] "countNA" "cover" "crds"
[70] "crop" "crosstab" "crs"
[73] "crs<-" "datatype" "deepcopy"
[76] "delaunay" "densify" "density"
[79] "depth" "depth<-" "describe"
[82] "diff" "direction" "disagg"
[85] "distance" "dots" "draw"
[88] "droplevels" "elongate" "emptyGeoms"
[91] "erase" "expanse" "ext"
[94] "ext<-" "extend" "extract"
[97] "extractAlong" "extractRange" "fileBlocksize"
[100] "fillHoles" "fillTime" "flip"
[103] "focal" "focal3D" "focalCor"
[106] "focalCpp" "focalMat" "focalPairs"
[109] "focalReg" "focalValues" "forceCCW"
[112] "free_RAM" "freq" "gaps"
[115] "gdal" "gdalCache" "geom"
[118] "geomtype" "getGDALconfig" "getTileExtents"
[121] "global" "graticule" "gridDist"
[124] "gridDistance" "halo" "has.colors"
[127] "has.RGB" "has.time" "hasMinMax"
[130] "hasValues" "head" "hist"
[133] "identical" "ifel" "image"
[136] "impose" "inext" "init"
[139] "inMemory" "inset" "interpIDW"
[142] "interpNear" "interpolate" "intersect"
[145] "is.bool" "is.empty" "is.factor"
[148] "is.int" "is.lines" "is.lonlat"
[151] "is.points" "is.polygons" "is.related"
[154] "is.rotated" "is.valid" "isFALSE"
[157] "isTRUE" "k_means" "lapp"
[160] "layerCor" "levels" "linearUnits"
[163] "lines" "logic" "Logic"
[166] "longnames" "longnames<-" "makeNodes"
[169] "makeTiles" "makeValid" "makeVRT"
[172] "map.pal" "mask" "match"
[175] "math" "Math" "Math2"
[178] "mean" "median" "mem_info"
[181] "merge" "mergeLines" "mergeTime"
[184] "meta" "metags" "metags<-"
[187] "minCircle" "minmax" "minRect"
[190] "modal" "mosaic" "na.omit"
[193] "NAflag" "NAflag<-" "names"
[196] "ncell" "ncol" "ncol<-"
[199] "nearby" "nearest" "nlyr"
[202] "nlyr<-" "noNA" "normalize.longitude"
[205] "north" "not.na" "nrow"
[208] "nrow<-" "nsrc" "origin"
[211] "origin<-" "pairs" "panel"
[214] "patches" "perim" "persp"
[217] "plet" "plot" "plotRGB"
[220] "points" "polys" "prcomp"
[223] "predict" "princomp" "project"
[226] "quantile" "query" "rangeFill"
[229] "rapp" "rast" "rasterize"
[232] "rasterizeGeom" "rasterizeWin" "rcl"
[235] "readRDS" "readStart" "readStop"
[238] "readValues" "rectify" "regress"
[241] "relate" "removeDupNodes" "res"
[244] "res<-" "resample" "rescale"
[247] "rev" "RGB" "RGB<-"
[250] "roll" "rotate" "round"
[253] "rowColCombine" "rowColFromCell" "rowFromCell"
[256] "rowFromY" "same.crs" "sapp"
[259] "saveRDS" "sbar" "scale"
[262] "scoff" "scoff<-" "sds"
[265] "segregate" "sel" "selectHighest"
[268] "selectRange" "serialize" "set.cats"
[271] "set.crs" "set.ext" "set.names"
[274] "set.RGB" "set.values" "setGDALconfig"
[277] "setMinMax" "setValues" "shade"
[280] "sharedPaths" "shift" "sieve"
[283] "simplifyGeom" "size" "snap"
[286] "sort" "sources" "spatSample"
[289] "spin" "split" "sprc"
[292] "stdev" "stretch" "subset"
[295] "subst" "summary" "Summary"
[298] "svc" "symdif" "t"
[301] "tail" "tapp" "terrain"
[304] "terraOptions" "text" "tighten"
[307] "time" "time<-" "timeInfo"
[310] "tmpFiles" "trans" "trim"
[313] "union" "unique" "units"
[316] "units<-" "unserialize" "unwrap"
[319] "update" "values" "values<-"
[322] "varnames" "varnames<-" "vect"
[325] "vector_layers" "viewshed" "voronoi"
[328] "vrt" "vrt_tiles" "weighted.mean"
[331] "where.max" "where.min" "which.lyr"
[334] "which.max" "which.min" "width"
[337] "window" "window<-" "wrap"
[340] "wrapCache" "writeCDF" "writeRaster"
[343] "writeStart" "writeStop" "writeValues"
[346] "writeVector" "xapp" "xFromCell"
[349] "xFromCol" "xmax" "xmax<-"
[352] "xmin" "xmin<-" "xres"
[355] "xyFromCell" "yFromCell" "yFromRow"
[358] "ymax" "ymax<-" "ymin"
[361] "ymin<-" "yres" "zonal"
[364] "zoom"
Create and define raster
class : SpatRaster
dimensions : 10, 10, 1 (nrow, ncol, nlyr)
resolution : 36, 18 (x, y)
extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84
source(s) : memory
name : lyr.1
min value : 0.01871994
max value : 0.99392609
Basic raster in R.
We can access data from specific locations within a raster
Rasters can be stacked which can make them very efficient to work with
FedData package, but numerous other packages exist to access data within and without the U.S.elevatr package for accessing elevation data around the worldWe will walk through an example of extracting information from a raster using a polygon layer. To do this we will:
oregon_cities
data('us.cities') from the maps libraryFedData package08:00
1-2. Buffer city of your choice
R
The USGS’s StreamStats is an online service and map interface that allows users to navigate to a desired location and delineate a watershed boundary with the click of a mouse:
https://streamstats.usgs.gov/ss/
In addition to the map interface, the data are also accessible via an API:
It is also possible to replicate this functionality in R:
streamstats_ws = function(state, longitude, latitude){
p1 = 'https://streamstats.usgs.gov/streamstatsservices/watershed.geojson?rcode='
p2 = '&xlocation='
p3 = '&ylocation='
p4 = '&crs=4269&includeparameters=false&includeflowtypes=false&includefeatures=true&simplify=true'
query <- paste0(p1, state, p2, toString(longitude), p3, toString(latitude), p4)
mydata <- jsonlite::fromJSON(query, simplifyVector = FALSE, simplifyDataFrame = FALSE)
poly_geojsonsting <- jsonlite::toJSON(mydata$featurecollection[[2]]$feature, auto_unbox = TRUE)
poly <- geojsonio::geojson_sf(poly_geojsonsting)
poly
}
# Define location for delineation (Calapooia Watershed)
state <- 'OR'
latitude <- 44.62892
longitude <- -123.13113
# Delineate watershed
cal_ws <- streamstats_ws('OR', longitude, latitude) %>%
st_transform(crs = 5070)nhdplusTools is an R package that can access the Network Linked Data Index (NLDI) service, which provides navigation and extraction of NHDPlus data: https://doi-usgs.github.io/nhdplusTools/
nhdplusTools includes network navigation options as well as watershed delineationlibrary(nhdplusTools)
# Simple feature option to generate point without any other attributes
cal_pt <- st_sfc(st_point(c(longitude, latitude)), crs = 4269)
# Identify the network location (NHDPlus common ID or COMID)
start_comid <- nhdplusTools::discover_nhdplus_id(cal_pt)
# Combine info into list (required by NLDI basin function)
ws_source <- list(featureSource = "comid", featureID = start_comid)
cal_ws2 <- nhdplusTools::get_nldi_basin(nldi_feature = ws_source)nhdplusTools works by walking a network of pre-staged sub-catchments with unique IDs (COMIDs)
whitebox R packageelevatr that can access DEM data from the web both within and outside the U.S.mapview::mapview to plot watershed08:00
# Logan River Watershed
latitude <- 41.707
longitude <- -111.855
# Define the lat/lon
start_point <- st_sfc(st_point(c(longitude, latitude)), crs = 4269)
# Find COMID of this point
start_comid <- nhdplusTools::discover_nhdplus_id(start_point)
# Create a list object that defines the feature source and starting COMID
ws_source <- list(featureSource = "comid", featureID = start_comid)
# Delineate basin
logan_ws <- nhdplusTools::get_nldi_basin(nldi_feature = ws_source)Let’s revisit the Calapooia watershed and calculate percent row crop for 2019 with FedData
Now let’s extract the same information using the StreamCatTools package to access the online StreamCat database
StreamCatToolsStreamCatTools package provide access to the same information with much less code because StreamCat pre-stages watershed metrics for each NHDPlus catchmentStreamCatToolsStreamCat also provides metrics at several scales
StreamCatToolsStreamCatToolsFor a subset of watershed metrics, summaries within ~100m of the stream segment are also available for catchments and watersheds:
StreamCatToolssc_get_data(comid = comid,
metric = 'PctCrop2019',
aoi = 'catchment,riparian_catchment,watershed,riparian_watershed') %>%
as.data.frame() COMID WSAREASQKM WSAREASQKMRP100 CATAREASQKM CATAREASQKMRP100
1 23763521 964.6101 136.17 8.4933 1.1097
PCTCROP2019CATRP100 PCTCROP2019WSRP100 PCTCROP2019CAT PCTCROP2019WS
1 9.08 8.4 36.59 10.49
StreamCatToolsStreamCatTools also makes it very easy to grab data for entire regions.
StreamCatToolsStreamCatToolsStreamCatTools'PctCrop2001, PctCrop2019', rather than a vector of metrics: c('PctCrop2001', 'PctCrop2019').'pctcrop2001, pctcrop2019' or 'PCTCROP2001,PCTCROP2019'.StreamCatToolsStreamCat contains hundreds of metrics and we recommend consulting the metric list to identify those of interest for your study.
The R function to access LakeCat data was designed to parallel StreamCat functions
Let’s walk through an example together that:
sf object of a sample point at Pelican Lake, WIlibrary(nhdplusTools)
# Pelican Lake, WI
latitude <- 45.502840
longitude <- -89.198694
pelican_pt <- data.frame(site_name = 'Pelican Lake',
latitude = latitude,
longitude = longitude) %>%
st_as_sf(coords = c('longitude', 'latitude'), crs = 4326)
pelican_lake <- nhdplusTools::get_waterbodies(pelican_pt)
comid <- pelican_lake %>%
pull(comid)
lc_get_data(metric = 'elev, cao, sand, om, pctdecid2019',
aoi = 'watershed',
comid = comid)# A tibble: 1 × 7
COMID WSAREASQKM SANDWS ELEVWS CAOWS PCTDECID2019WS OMWS
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 167120863 41.0 59.6 491. 4.81 14.7 16.7
Query LakeCat to determine if the basins of these lakes had more deciduous or conifer forest in 2019:
“9028333,9025609,9025611,9025635”
The LakeCat metrics page may help
05:00
library(StreamCatTools)
comids <- "9028333,9025609,9025611,9025635"
lc_get_data(metric = 'pctdecid2019, pctconif2019',
aoi = 'watershed',
comid = comids) # A tibble: 4 × 4
COMID WSAREASQKM PCTDECID2019WS PCTCONIF2019WS
<dbl> <dbl> <dbl> <dbl>
1 9025609 5.27 18.1 0
2 9025611 0.447 19.3 0
3 9025635 13.6 22.3 0.339
4 9028333 6.62 34.5 2.32
spmodel to model phenomena in two types of freshwaters:
Load required packages…
Read and prep table of lake conductivity values…
# Read in states to give some context
states <- tigris::states(cb = TRUE, progress_bar = FALSE) %>%
filter(!STUSPS %in% c('HI', 'PR', 'AK', 'MP', 'GU', 'AS', 'VI')) %>%
st_transform(crs = 5070)
# Read in lakes, select/massage columns, convert to spatial object
lake_cond <- fread('nla_obs.csv') %>%
select(UNIQUE_ID, COMID, COND_RESULT,
AREA_HA, DSGN_CYCLE,
XCOORD, YCOORD) %>%
mutate(DSGN_CYCLE = factor(DSGN_CYCLE)) %>%
st_as_sf(coords = c('XCOORD', 'YCOORD'),
crs = "+proj=aea +lat_0=37.5 +lon_0=-96 +lat_1=29.5 +lat_2=45.5 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs +type=crs") %>%
st_transform(crs = 5070)Select sample sites within Minnesota
We will use the similar watershed predictors as Dumelle et al. (2023).
Already included with data table with the response variable are:
From LakeCat, we also will get the following predictor variables:
comids <- cond_mn$COMID
mn_lakecat <- lc_get_data(comid = comids,
metric = 'Tmean8110, Precip8110, S') %>%
select(COMID, TMEAN8110CAT, PRECIP8110WS, SWS)
mn_lakecat# A tibble: 115 × 4
COMID TMEAN8110CAT PRECIP8110WS SWS
<dbl> <dbl> <dbl> <dbl>
1 1099282 7.69 825. 0.0365
2 1101812 7.91 812. 0.0329
3 1768378 3.51 696. 0.0985
4 1775870 4.28 749. 0.272
5 1776124 4.43 778. 0.272
6 2032039 7.23 794. 0.0712
7 2034063 7.25 845. 0.0712
8 2262633 5.58 768. 0.0712
9 2262729 5.62 786. 0.0329
10 2350690 6.71 735. 0.0712
# ℹ 105 more rows
In addition to these static LakeCat data, we would also like to pull in data from specific years of NLCD to match sample years for % of watershed composed of crop area
To do this we will need to:
crop <-
# Grab LakeCat crop data
lc_get_data(comid = comids,
aoi = 'watershed',
metric = 'pctcrop2006, pctcrop2011, pctcrop2016') %>%
# Remove watershed area from data
select(-WSAREASQKM) %>%
# Pivot table to long to create "year" column
pivot_longer(!COMID, names_to = 'tmpcol', values_to = 'PCTCROPWS') %>%
# Remove PCTCROP and WS to make "year" column
mutate(year = as.integer(
str_replace_all(tmpcol, 'PCTCROP|WS', ''))) %>%
# Add 1 to each year to match NLA years
mutate(year = factor(year + 1)) %>%
# Remove the tmp column
select(-tmpcol)Now, join the tables to make our model data
Define the formula
Run a non-spatial and spatial model for comparison
cond_mod <- splm(formula = formula,
data = model_data,
spcov_type = 'none')
cond_spmod <- splm(formula = formula,
data = model_data,
spcov_type = 'exponential')
summary(cond_spmod)
Call:
splm(formula = formula, data = model_data, spcov_type = "exponential")
Residuals:
Min 1Q Median 3Q Max
-1.87192 -0.19602 0.02576 0.29597 1.28615
Coefficients (fixed):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.486e+00 8.813e-01 9.630 < 2e-16 ***
AREA_HA 2.394e-05 5.510e-05 0.435 0.6639
year2012 -1.270e-01 3.261e-02 -3.896 9.80e-05 ***
year2017 -9.871e-02 3.851e-02 -2.563 0.0104 *
TMEAN8110CAT 5.375e-01 6.740e-02 7.976 1.55e-15 ***
PRECIP8110WS -8.406e-03 1.313e-03 -6.404 1.51e-10 ***
PCTCROPWS 4.317e-03 2.767e-03 1.560 0.1188
SWS 1.110e+00 7.740e-01 1.435 0.1514
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pseudo R-squared: 0.5024
Coefficients (exponential spatial covariance):
de ie range
2.685e-01 1.371e-02 2.167e+04
Model comparison
Leave-one-out comparison
prd_mod <- spmodel::loocv(cond_mod, se.fit = TRUE, cv_predict = TRUE)
prd_spmod <- spmodel::loocv(cond_spmod, se.fit = TRUE, cv_predict = TRUE)
rbind(prd_mod %>% pluck('stats'),
prd_spmod %>% pluck('stats'))# A tibble: 2 × 4
bias MSPE RMSPE cor2
<dbl> <dbl> <dbl> <dbl>
1 -0.00261 0.270 0.519 0.731
2 -0.00392 0.140 0.374 0.860
Join back to model data for mapping
In this example, we will model the occurrence of the Argia damselfly in several northeastern states
finsyncRTo start this exercise, we’ll introduce a new R package called finsyncR. This package was developed by US EPA scientists and academic collaborators.
finsyncRfinsyncR provides us with access to thousands of biological samples across the U.S. from both EPA and USGS sources (e.g., Rumschlag et al. 2023). We’ll be using EPA data todayLoad additional packages
Next, use finsyncR to get genus-level macroinvert data from just EPA and rarefy to 300 count
The code also converts the data to occurrence data (1 = detect, 0 = non-detect) and set a seed to make it reproducible.
macros <- getInvertData(dataType = "occur",
taxonLevel = "Genus",
agency = "EPA",
lifestage = FALSE,
rarefy = TRUE,
rarefyCount = 300,
sharedTaxa = FALSE,
seed = 1,
boatableStreams = T)
finsyncR is running: Gathering, joining, and cleaning EPA raw data
finsyncR is running: Rarefying EPA data
finsyncR is running: Applying taxonomic fixes to EPA data
finsyncR is running: Finalizing data for output
finsyncR data synchronization complete
[1] 6174 856
Clean data:
lubridate::date()) and convert presence/absence to a factorsf object and transform to EPSG:5070# Flexible code so we could model another taxon
genus <- 'Argia'
taxon = macros %>%
dplyr::select(SampleID,
ProjectLabel,
CollectionDate,
Latitude_dd,
Longitude_dd,
all_of(genus)) %>%
#filter(ProjectLabel != 'WSA') %>%
mutate(CollectionDate = date(CollectionDate),
presence =
as.factor(pull(., genus))) %>%
st_as_sf(coords = c('Longitude_dd', 'Latitude_dd'), crs = 4269) %>%
st_transform(crs = 5070)For today’s exercise, we’ll narrow down the samples to the northeaster region of the U.S.
region <- states %>%
filter(STUSPS %in% c('VT', 'NH', 'ME', 'NY', 'RI',
'MA', 'CT', 'NJ', 'PA', 'DE'))
# Use region as spatial filter (sf::st_filter()) for taxon of interest
taxon_rg <- taxon %>%
st_filter(region) %>%
filter(ProjectLabel %in% c('NRSA1314', 'NRSA1819')) %>%
mutate(year = year(ymd(CollectionDate))) %>%
select(SampleID:CollectionDate, presence:year)
taxon_rg %>%
pull(presence) %>%
table().
0 1
485 65
nhdplusTools
prism package requires that we set a temporary folder in our work space. Here, we set it to “prism_data” inside of our “data” folder. It will create this folder if it does not already exist.terra::extract() toDon’t worry - we’ll walk through this together
library(prism)
# Get these years of PRISM
years <- c(2013, 2014, 2018, 2019)
# Set the PRISM directory (creates directory in not present)
prism_set_dl_dir("./data/prism_data", create = TRUE)
# Download monthly PRISM rasters (tmean)
get_prism_monthlys('tmean',
years = years,
mon = 7:8,
keepZip = FALSE)
|
| | 0%
|
|========= | 12%
|
|================== | 25%
|
|========================== | 38%
|
|=================================== | 50%
|
|============================================ | 62%
|
|==================================================== | 75%
|
|============================================================= | 88%
|
|======================================================================| 100%
# Create stack of downloaded PRISM rasters
tmn <- pd_stack((prism_archive_subset("tmean","monthly",
years = years,
mon = 7:8)))
# Extract tmean at sample points and massage data
tmn <- terra::extract(tmn,
# Transform taxon_rg to CRS of PRISM on the fly
taxon_rg %>%
st_transform(crs = st_crs(tmn))) %>%
# Add COMIDs to extracted values
data.frame(COMID = comid_vect, .) %>%
# Remove front and back text from PRISM year/month in names
rename_with( ~ stringr::str_replace_all(., 'PRISM_tmean_stable_4kmM3_|_bil', '')) %>%
# Pivot to long table and calle column TMEANPRISMXXXXPT, XXXX indicates year
pivot_longer(!COMID, names_to = 'year_month',
values_to = 'TMEANPRISMXXXXPT') %>%
# Create new column of year
mutate(year = year(ym(year_month))) %>%
# Average July and August temperatures
summarise(TMEANPRISMXXXXPT = mean(TMEANPRISMXXXXPT, na.rm = TRUE),
.by = c(COMID, year))Combine Dependent and Independent Data
Model formulation
Run non-spatial and spatial models for comparison
bin_mod <- spglm(formula = formula,
data = model_data,
family = 'binomial',
spcov_type = 'none')
bin_spmod <- spglm(formula = formula,
data = model_data,
family = 'binomial',
spcov_type = 'exponential')
summary(bin_spmod)
Call:
spglm(formula = formula, family = "binomial", data = model_data,
spcov_type = "exponential")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4200 -0.4453 -0.3030 -0.1717 2.6018
Coefficients (fixed):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.603140 5.862970 1.638 0.10144
I(log10(WSAREASQKM)) 0.760753 0.190182 4.000 6.33e-05 ***
ELEVWS -0.007621 0.002402 -3.172 0.00151 **
WETINDEXWS 0.001087 0.003466 0.314 0.75384
BFIWS -0.050563 0.039372 -1.284 0.19906
PRECIP8110WS -0.002916 0.002653 -1.099 0.27172
TMEANPRISMXXXXPT -0.300855 0.140418 -2.143 0.03215 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pseudo R-squared: 0.09293
Coefficients (exponential spatial covariance):
de ie range
2.458e+00 1.210e-02 3.703e+04
Coefficients (Dispersion for binomial family):
dispersion
1
# Function to convert from log odds to probability
to_prob <- function(x) exp(x)/(1+exp(x))
# loocv of non-spatial model
loocv_mod <- loocv(bin_mod, cv_predict = TRUE, se.fit = TRUE)
# loocv of spatial model
loocv_spmod <- loocv(bin_spmod, cv_predict = TRUE, se.fit = TRUE)
pROC::auc(model_data$presence, loocv_mod$cv_predict)Area under the curve: 0.7765
Area under the curve: 0.9085
Extract values from loocv() object and join to model data for mapping